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Abstract — Gaussian and quadratic approximations of message 
passing algorithms on graphs have attracted considerable recent 
attention due to their computational simplicity, analytic tractabil- 
ity, and wide applicability in optimization and statistical inference 
problems. This paper presents a systematic framework for incor- 
porating such approximate message passing (AMP) methods in 
general graphical models. The key concept is a partition of depen- 
dencies of a general graphical model into strong and weak edges, 
with the weak edges representing interactions through aggregates 
of small, linearizable couplings of variables. AMP approximations 
based on the Central Limit Theorem can be readily applied to 
the weak edges and integrated with standard message passing 
updates on the strong edges. The resulting algorithm, which we 
call hybrid generalized approximate message passing (Hybrid- 
GAMP), can yield significantly simpler implementations of sum- 
product and max-sum loopy belief propagation. By varying the 
partition of strong and weak edges, a performance-complexity 
trade-off can be achieved. Group sparsity problems are studied as 
an example of this general methodology where there is a natural 
partition of edges. 

Index Terms — belief propagation, estimation, group sparsity, 
max-sum algorithm, maximum a posteriori probability, minimum 
mean-squared error, optimization, simultaneous sparsity, sum- 
product algorithm 

I. Introduction 

Message passing algorithms on graphical models have be- 
come widely-used in high-dimensional optimization and infer- 
ence problems in a range of fields |[T|, jSj. The fundamental 
principle of graphical models is to factor high-dimensional 
problems into sets of smaller problems of lower dimension. 
The factorization is represented via a graph where the problem 
variables and factors are represented by the graph vertices, 
and the dependencies between them represented by edges. 
Message passing methods such as loopy belief propagation 
(BP) use this graphical structure to perform approximate 
inference or optimization in an iterative manner. In each 
iteration, inference or optimization is performed "locally" on 
the sub-problems associated with each factor, and "messages" 
are passed between the variables and factors to account for 
the coupling between the local problems. 

Although effective in a range of problems, loopy BP is only 
as computationally simple as the problems in the constituent 
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factors. If the factors themselves are of high dimensions, 
exact implementation of loopy BP will be computationally 
intractable. 

To reduce the complexity of loopy BP, this paper presents 
a Hybrid- generalized approximate message passing (Hybrid- 
GAMP) algorithm for what we call graphical models with 
linear mixing. The basic idea is that when factors depend on 
large numbers of variables, the dependencies are often through 
aggregates of small, linearizable perturbations. In the proposed 
graphical model with linear mixing, these linear, weak interac- 
tions are identified by partitioning the graph edges into weak 
and strong edges, with the dependencies on the weak edges 
being described by a linear transform. Under the assumption 
that the components of the linear transform are small, it is 
argued that the computations for the messages of standard 
loopy BP along the weak edges can be significantly simplified. 
The approximate message passing along with the weak edges 
are integrated with the standard loopy BP messages on the 
strong edges. 

We illustrate this Hybrid-GAMP methodology in the context 
of two common variants of loopy BP: the sum-product algo- 
rithm for inference (e.g., computation of a posterior mean) and 
the max-sum algorithm for optimization (e.g., computation of a 
posterior mode). For the sum-product loopy BP algorithm, we 
show that the messages along the weak edges can be approx- 
imated as Gaussian random variables and the computations 
for these messages can be simplified via the Central Limit 
Theorem. For max-sum loopy BP, we argue that one can use 
quadratic approximations of the messages and perform the 
computations via a simple least-squares solution. 

These approximations can dramatically simplify the com- 
putations. The complexity of standard loopy BP generically 
grows exponentially with the maximum degree of the factor 
nodes. With the GAMP approximation, however, the com- 
plexity is exponential only in the maximum degree from the 
strong edges, while it is linear in the number of weak edges. 
As a result, Hybrid-GAMP algorithms on a graphical model 
with linear mixing can remain tractable even with very large 
numbers of weak, linearizable interactions. 

Gaussian and quadratic approximations for message passing 
algorithms with linear dependencies are, of course, not new. 
The purpose of this paper is to provide a systematic and 
general framework for these approximations that incorporate 
and extend many earlier algorithms. For example, many previ- 
ous works have considered Gaussian approximations of loopy 
BP for the problem of estimating vectors with independent 
components observed through noisy, linear measurements 1^- 
||9j. In the terminology of this paper, these algorithms apply to 
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graphs where all the non-trivial edges are weak. As discussed 



in Section [ni| by enabling graphs that have mixes of both 
strong and weak edges, the framework of this paper can 
significantly generalize these methods. For example, instead of 
the unknown vector simply having independent components, 
the presence of strong edges can enable the vector to have any 
prior describable with a graphical model. 

The approach here of combining approximate message 
passing methods and standard graphical model with linear 
mixing is closest to the methods developed in |T0|-|T3| 
for wavelet image denoising and turbo equalization. These 
works also considered graphical models that had both linear 
and nonlinear components, and applied approximate message 
passing techniques along the lines of |[7|, |[8j| to the lineariz- 
able portions while maintaining standard BP updates in the 
remainder of the graph. The use of approximate message 
passing methods on portions of a factor graph has also been 
applied with joint parameter estimation and decoding for 
CDMA multiuser detection in |T4| ; in a wireless interference 
coordination problem in flS], and proposed in fT6, Section 
7] in the context of compressed sensing. The framework 
presented here unifies and extends all of these examples 
and thus provides a systematic procedure for incorporating 
Gaussian approximations of message passing in a modular 
manner in general graphical models. 

II. Graphical Model Problems with Linear Mixing 
Let X and z be real-valued block column vectors 



• • • 7 ^n) 5 



(1) 



and consider a function of these vectors of the form 

m 

F(x,z) :=^/,(x,(,),zO, (2) 

where, for each i, fi{-) is a real-valued function; a{i) is a 
subset of the indices {1, . . . , n}; and XQ,(i) is the concatenation 
of the vectors {xj, j G a{i)}. We will be interested in 
computations on this function subject to linear constraints of 
the form 



(3) 



where each Aij is a real-valued matrix and A^ is the block 
column matrix with components A^j. We will also let A be 
the block matrix with components Aij so that we can write 
the linear constraints as z = Ax. 

The function F(x, z) is naturally described via a graphical 
model as shown in Fig. [T] Specifically, we associate with 
F(x, z) a bipartite factor graph G = (F, E) whose vertices 
V consist of n variable nodes corresponding to the (vector- 
valued) variables Xj , and m factor nodes corresponding to the 
factors fi{-) in ([2]). There is an edge (z, j) G E in the graph 
if and only if the variable Xj has some influence on the factor 
/^(xQ,(i), z^). This influence can occur in one of two mutually 
exclusive ways: 

• The index j is in a{i), so that the variable Xj directly 
appears in the sub-vector ^cx{i) in th^ factor fi{:>Cc,(i)^Zi). 



Strong edges ^1 

x-^3 



A 

Mixing 
matrix 



Weak edges 



Fig. 1. Factor graph representation of the linear mixing estimation and 
optimization problems. The variable nodes (circles) are connected to the factor 
nodes (squares) either directly (strong edges) or via the output of the linear 
mixing matrix A (weak edges). 



In this case, (i, j) will be called a strong edge, since Xj 
can have an arbitrary and potentially-large influence on 
the factor. 

• The matrix Aij is nonzero, so that Xj affects fi{yioc{i) i Zi) 
through its linear influence on z^ in (jSj. In this case, 

will be called a weak edge, since the approximations we 
will make in the algorithms below assume that Aij is 
small. The set of weak edges into the factor node i will 
be denoted 

Together a{i) and I3{i) comprise the set of all indices j 
such that the variable node Xj is connected to the factor node 
fi{-) in the graph G. The union d{i) = a{i) U j3{i), is thus 
the neighbor set of fi{-). Similarly, for any variable node Xj, 
we let a{j) be the set of indices i such that that the factor 
node fi{-) is connected to Xj via a strong edge, and let 
be the set of indices i such that there is a weak edge. We let 
d{j) = a{j) U be the union of these sets, which is the 
neighbor set of x^. 

Given these definitions, we are interested in two problems: 

• Optimization problem P-OPT: Given a function F(x, z) 
of the form ([2]) and a matrix A, compute the maxima: 

X = arg maxF(x, z), z = Ax. (4) 

X : z=Ax 

Also, for each j, compute the marginal value function 



Aj(xj) := max F{-k^z), 



x\ T : z= Ax 



(5) 



where the maximization is over all variables x^ for r j. 
• Expectation problem P-EXP: Given a function F(x, z) 
of the form ([2]), a matrix A, and scale factor u > 0, 
define the joint distribution 

p(x) := exp [i^F(x, z)] , z = Ax (6) 



Z(u) 



where Z{u) is a normalization constant called the parti- 
tion function (it is a function of u). For this distribution, 
compute the expectations 



E[x], 



E[z]. 



(7) 
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Also, for each j, compute the log marginal 

Aj(xj) := — log J exp [iiF(x, z)] dx^ 



(8) 



where the integral is over all variables x^ for r ^ j. 

Both problems P-OPT and P-EXP naturally arise in statisti- 
cal inference: Suppose we are given a probability distribution 
p(x) of the form ([6]) for some function F(x, z). The function 
F(x, z) may depend implicitly on some observed vector y, 
so that p(x) represents the posterior distribution of x given 
y. In this context, the solution (x,z) to the problem P-OPT 
is precisely the maximum a posteriori (MAP) estimate of x 
and z given the observations y. Similarly, the solution (x,z) 
to the problem P-EXP is precisely the minimum mean squared 
error (MMSE) estimate when u = 1. For P-EXP, the function 
Aj(xj) is the log marginal distribution of Xj. 

The two problems are related: A standard large deviations 
argument |17 | shows that, under suitable conditions, as ^ 
oo the distribution p(x) in ([6]) concentrates around the maxima 
(x,z) in the solution to the problem P-OPT. As a result, the 
solution (x,z) to P-EXP converges to the solution to P-OPT. 

A. Further Assumptions and Notation 

In the analysis below, we will assume that, for all factor 
nodes fi{-), the strong and weak neighbors, a{i) and 
are disjoint. That is, 

a(i)n/3(i) =0. (9) 

This assumption introduces no loss of generality: If an edge 
is both weak and strong, we can modify the function 
/i(xQ,(z), z^) to "move" the influence of Xj from the term z^ 
into the direct term XQ,(i). For example, suppose that for some 

7.i = A^iXi + AisXs + Ai4X4 

and a{i) = {1,2}. In this case, the edge (z, 1) is both strong 
and weak. That is, the function /^(xQ,(i), z^) depends on the 
variable xi both directly through Xa{i) ^^d through z^. To 
satisfy the assumption ([9]), we define a new z^ 



AisXs + Ai4X4, 



and new function /^(•) 



(x«(,),zr-) = /r-((xi,x2),zr-) 

= /,((xi,X2),AaXi+zr"). 
With these definitions, 

/,(x«(,),z,) = /r-(x«(,),zr-). 

Therefore we can replace /^(•) and z^ with /"^^(•) and z^®^ 
and obtain an equivalent problem. 

Even when the dependence of a factor /^(xQ,(i), z^) on a 
variable Xj is only through the linear term z^, we may still 
wish to "move" the dependence to a strong edge. The reason 
is that the GAMP algorithm below assumes that the linear 
dependence is weak, that is, Aij is small. If that is not the 
case, the dependence can be treated as strong, where the small- 
term approximations are not made. This improves the accuracy 
of the method at the expense of greater computation. 
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Fig. 2. An example of a simple graphical model for an estimation problem 
where x has independent components with priors p(xj), z = Ax, and the 
observation vector y is the output of a componentwise measurement channel 
with transition function p(yi | z^). 



One final piece of notation: since Aij ^ only when j G 
I3{i), we may sometimes write the summation ^ as 



(10) 



where x^(^) is the sub- vector of x with components j G 
and A^^(^) is the corresponding portion of the zth block-row 
of A. 

III. Motivating Examples 

We begin with a basic development to show that some 
previously-studied problems with a fully separable prior on 
X fit within our model. Then we show an extension to more 
complicated problems. A more extensive example is deferred 
to Section [Vll 

Linear Mixing and General Output Channel — Independent 
Variables: As a simple example of a graphical model with 
linear mixing, consider the following estimation problem: An 
unknown vector x has independent components x ^ , each with 
probability distribution p(xj). The vector x is passed through 
a linear transform to yield an output z = Ax. Each component 

of z then randomly generates an output component with 
a conditional distribution p{yi | z^). The problem is to estimate 
X given the observations y and transform matrix A. 

Under the assumption that the components Xj are indepen- 
dent, and the components y^ are conditionally independent 
given z, the posterior distribution of x factors as 



p(x I y) 



1 



Wp{yi I Zi) ]^p(x^-). z = Ax, 



where Z{y) is a normalization constant. If we now fix some 
observed y, we can rewrite this posterior as 



p(x I y) (X exp [F(x, z)] , 
where F(x, z) is the log posterior. 



Ax, 



^(x, z) = ^ logp(yi I Zi) + ^ logp(x^), 

and the dependence on y is implicit. The log posterior is 
therefore in the form of ^ with the scale factor u = 1 and 
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m-\-n factors, fi{-), i = 1, . . 
be used for the output terms 



,m-\-n. The first m factors can 



fi{zi) = \ogp{yi\zi), 



which do not depend on any of the terms Xj. That is, = 
for each variable index j. The remaining n factors are for the 
inputs 



j = l,...,n. 



For these factors, the strong edge set is the singleton a(m + 
j) = {j}, and there is no linear term; we can think of Zm-\-j 
as zero-dimensional. The corresponding factor graph with the 
m-\-n factors is shown in Fig. |2] 

In the case when all the variables Xj and are scalars, the 
estimation problem is precisely the set-up considered in |6J, 
|T8| , as mentioned in the introduction. The special subcase 
of this problem when each output measurement yi is Zi with 
additive white Gaussian noise (AWGN), 



Zi + Wi 



(11) 



is considered in several references |[3|, |[5|, |[8|, (91, | [T9| for a 
variety of different distributions p{xj). 

Linear Mixing and General Output Channel — Dependent 
Variables: The graphical model framework considered here is 
significantly more general. For example, consider the graphical 
model in Fig. |3] In this case, the components of the input 
vector are no longer necessarily independent, but instead 
described themselves by a graphical model. Some additional 
latent variables in the vector u may also be added. For 
example, pO| used a discrete Markov chain to model clustered 
sparsity, (UJ used discrete-Markov and Gauss-Markov chains 
to model slow changes in support and amplitude across multi- 
ple measurement vectors, and |12| used a discrete Markov 
tree model to capture the persistence across scales in the 
wavelet coefficients of an image. In the example we present in 



Section |Vlj we will use the graphical model to represent joint 
sparsity. Similarly, the output need not involve a separable 
mapping, and the observations y can depend on the outputs 
z through a second graphical model, also with some latent 
variables v that can act as unknown parameters. For example, 
the output could be an unknown nonlinear function and the 
parameters v may be the parameters in the function. This 
technique was used in 1 13 1 to incorporate constraints on LDPC 
coded bits when performing turbo sparse-channel estimation, 
equalization, and decoding using GAMP. 

IV. Review of Loopy Belief Propagation 

Finding exact solutions to the problems P-OPT and P-EXP 
above is generally intractable for most factors fi{-), as 
the solutions requires optimizations or expectations over all 
n variables Xj. A widely-used approximate technique is 
loopy belief propagation |?|, f20l, which attempts to re- 
duce the optimization and estimation problems to a sequence 
of lower-dimensional problems associated with each factor 
/i(xQ,(i), z^). We consider two common variants of loopy BP: 
the max-sum algorithm for the problem P-OPT and the sum- 
product algorithm for the problem P-EXP. This section will 
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Fig. 3. A generalization of the model in Fig. [2] where the input variables 
X are themselves generated by a graphical model with latent variables u. 
Similarly, the dependence of the observation vector y on the linear mixing 
output z is through a second graphical model. 



briefly review these methods, as they will be the basis of the 
GAMP algorithms described in Section |V] 

The max-sum loopy BP algorithm is based on iteratively 
passing estimates of the marginal utilities Aj(xj) in ([5]) 
along the graph edges. Similarly, the sum-product loopy BP 
algorithm passes estimates of the log marginals Aj(xj) in ([8]). 
For either algorithm, we index the iterations by t = 0, 1, . . ., 
and denote the estimate "message" from the factor node fi to 
the variable node Xj in the tth iteration by A^^j(t,Xj) and 
the reverse message by A^^j(t,Xj). 

To describe the updates of the messages, we need to intro- 
duce some additional notation. First, the messages in loopy 
BP are equivalent up to a constant factor. That is, adding any 
constant term that does not depend on Xj to either the message 
Ai^j (t, Xj) or Ai^j (t, Xj) has no effect on the algorithm. We 
will thus use the notation 

A(x) ^ 5(x) ^ A(x) = 5(x) + C, 

for some constant C that does not depend on x. Similarly, we 
write p(x) oc g(x) when p(x) = Cq{^) for some constant C. 
Finally, for the sum-product algorithm, we will fix the scale 
factor > in the problem P-EXP, and, for any function 
A(-), we will write E[^(x) ; A(-)] to denote the expectation 
of ^(x) with respect to a distribution specified indirectly by 
A(.): 



E[5(x); A(-)] = j g{x)p{^)dx, 
where p(x) is the probabihty distribution 



(12) 



p(x) = 



Z{u 



■ exp ['uA(x)] 



and Z{u) is a normalization constant. Given these definitions, 
the updates for the messages in both the max-sum and sum- 
product loopy BP algorithms are as follows: 

Algorithm 1: Loopy belief propagation: Consider the 
problems P-OPT or P-EXP above for some function F(x, z) 
of the form ([2]) and matrix A. For the problem P-EXP, fix 
the scale factor u> {). The max-sum loopy BP algorithm for 
the problem P-OPT and the sum-product loopy BP for the 
problem P-EXP follow the following steps: 
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1) Initialization: Set t = and, for all G E, set 
Ai^j{t,yij) = 0. 

2) Factor node update: For all edges (z, j) G compute 
the function 

•= /i(xc(i),Zi) + ^ Ai^^(t,x^). (13) 

For max-sum loopy BP compute: 

Ai^j{t,:sij) = max i^i^_^-(t,X5»(i),Zi), (14) 

^d(i)\j 
Zi—AiX 

where and the maximization in ([14]) is over all variables 
with r e d{i) and r j, subject to the constraint 
Zi = A^x. 

For sum-product loopy BP compute: 

Ai^j{t,:>Cj) = ^\og J pi^j{t,:>^Q^i))dyiQ^i)\j, (15) 

where the integration is over variables x^ with r G d{i) 
and r ^ j, and pi^j{t^:siQ(^i^) is the probability distri- 
bution 

Pi^j{t,:>CQ^i^) (X exp [uHi^j{t,:>CQ^i^,Zi)] , 

Zi = A^x. (16) 

3) Variable node update: For all G E\ 
Also, let 

A,(t+l,x,)= ^ A,^,(t,x,). (18) 

For max-sum loopy BP, compute: 

x^(t+l) := argmaxA^(t + l,x^). (19) 

For sum-product loopy BP, compute: 

x^-(t+l) :=E[x^-; A^-(t+l,-)]. (20) 

Increment t and return to step 2 for some number of 
iterations. 

When the graph G is acyclic, then it can be shown that 
the max-sum and sum-product loopy BP algorithms above 
converge, respectively, to the exact solutions to the problems 
P-OPT and P-EXP in Section [ll| However, for graphs with 
cycles, loopy BP algorithm is, in general, only approximate. A 
complete analysis of loopy BP algorithm is beyond the scope 
of this work and is covered extensively elsewhere. See, for 
example, ||2|, (201 and li21J. 

What is important here is the computational complexity 
of loopy BP. Brute force solutions to the problems P-OPT 
and P-EXP in Section |ll| involve either a joint optimization or 
expectation over all n variables Xj. The loopy BP algorithm 
in contrast reduces these "global" problems to a sequence of 
"local" problems associated with each of the factors fi{-). 
The local optimization or expectation problems may be signif- 
icantly lower in dimension than the global problem. Consider, 



for example, the update in Step 2 at some factor node fi{-) and 
let d = \d{i)\ be the number of neighbors of fi{-). The factor 
fi{^a{i) 7 Zi) is thus a function of d variables x^, either through 
one of the strong edges or weak edges. For each 

j G d{i), the optimization ([14]) for the max-sum algorithm thus 
involves an optimization over d — 1 of the variables. Similarly, 
for the sum-product algorithm, the factor node update ( p3] ) 
involves an integration over d—1 variables. For certain classes 
of functions (such as the binary constraint functions in LDPC 
codes p2| , (23} , the optimization or expectation admits a 
simple solution. However, both optimization and expectations, 
in general, have complexities that grow exponentially in d. 
Thus, standard loopy BP is only typically tractable when the 
degrees d of the factor nodes are small or the factors have 
some particular form. 

V. Hybrid-gamp 

The Hybrid-GAMP algorithm reduces the cost of loopy 
BP by exploiting complexity-reducing approximations of the 
cumulative effect of the weak edges. We saw in the previous 
section that the loopy BP update at each factor node /i(-) 
has a cost that may be exponential in the degree d of the 
node, which consists of \a{i)\ strong edges and |/3(z)| weak 
edges. The Hybrid-GAMP algorithm with edge partitioning 
uses the linear mixing property to eliminate the exponential 
dependence on the weak edges, so the only exponential 

dependence is on the strong edges. Thus, the edge 

partitioning makes Hybrid-GAMP computationally tractable 
as long as the number of strong edges is small. There can be 
an arbitrary number of weak edges. In particular, the mixing 
matrix A can be dense. 

The basis of the Hybrid-GAMP approximation is to assume 
that the matrix A^j is small along any weak edge (i, j). Under 
this assumption, in the max-sum algorithm one can apply a 
quadratic approximation of the messages along the weak edges 
and reduce the factor node update to a standard least-squares 
problem. Similarly, in the sum-product algorithm, one can 
apply a Gaussian approximation of the weak edges messages 
and use the Central Limit Theorem at the factor nodes. 

A heuristic derivation of the Hybrid-GAMP approximations 
is given in Appendix |A| for the max-sum algorithm and 
Appendix |B] for the sum-product algorithm. We emphasize that 
these derivations are merely heuristic — we do not claim any 
formal matching between loopy BP and the Hybrid-GAMP 
approximation. 

To state the Hybrid-GAMP algorithm, we need additional 
notation: The Hybrid-GAMP algorithm produces a sequence 
of estimates Xj (t) and z^(t) for the variables Xj and z^. Several 
other intermediate variables Pi(t), Si{t) and Vj{t) are also 
produced. Associated with each of the variables are matrices 
Qj (^), Qf (t), . . ., that represent certain Hessians for the max- 
sum algorithm and covariances for the sum-product algorithm. 
When we need to take the inverse of the matrices, we will 
use the notation Qj'^(t) to mean (QJ(t))"^ Also, a* and 
A* denote the transposes of vector a and matrix A. Finally, 
for any positive definite matrix Q and vector a, we will let 
llallq = a*Q~^a, which is a weighted two norm. 
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Algorithm 2: Hybrid-GAMP: Consider the problems 
P-OPT or P-EXP above for some function F(x, z) of the 
form ^ and matrix A. For the problem P-EXP, fix the 
scale factor u > 0. The max-sum GAMP algorithm for 
the problem P-OPT and the sum-product GAMP for the 
problem P-EXP follow the following steps: 

1) Initialization: Set t = and select some initial values 
A^^j(t — l,Xj) for all strong edges and values 
rj{t — l) and QJ(t — 1) for all variable node indices j. 

2) Variable node update, strong edges: For all strong edges 
(i, j), compute 

3) Variable node update, weak edges: For all variable nodes 
j, compute 

A,(t,x,.) = H^{t,^,,Vj(t-l),Cfj{t-l)) (22) 



(21) 



and 



= E Jllfi-^illq- (23) 



For max-sum GAMP: 



Xj (t) = arg max A j (t, Xj ) , (24a) 

Q-^(t) = -|^A,(t,x,). (24b) 

For sum-product GAMP: 

%(t) = E(x,; A,(t,-)), (25a) 

QJ(t) = ^1 var (x^- ; A^-(t, •)) • (25b) 

4) Factor node update, linear step: For all factor nodes 
compute 





= E 




(26a) 




ie/3(i) 








= Mt)- 


-QP(t)ii(t-i), 


(26b) 




= E 


AijQ^- (t)A^^-, 


(26c) 











where, initially, we set Si( — 1) = 0. 
5) Factor node update, strong edges: For all strong edges 
(z, j), compute: 

+ ^ Ai^^(t,X^) - ^||Zi - Pillgp. (27) 

Then, for max-sum GAMP compute: 

= max iff_^/t,x„(i),Zi,pi(t),QP(t)), (28) 

where the maximization is over and all components 
x^ with r G a{i) \ j. 



For sum-product GAMP compute: 

Ai^j{t,yij) = ;^log y Pi^j{t,yiaii)^Zi)dyiaii)\jdzi 

(29) 

where the integral is over and all components x^ 
with r G a{i) \j, and p^^j(0,Xj) is the probability 
distribution function 

Pi^j(^,Xc,(i),Zi) (X 

exp (iii^f_^^-(t,Xc,(i).z^,pi(t),Qf(t))) . (30) 

6) Factor node update, weak edges: For all factor nodes i, 
compute 

(^,Xc,(i),Zi,pi,Q[) := /^(xc,(i),z^) 
+ A^^^(t,x^)-i||z^-p^||^p. (31a) 

rea{i) 

Then, for max-sum GAMP compute: 

(x°(,)(t),Z°(t)) 

:= argmaxiJf (t,x„(i),Zi,pi(t),QP(i)), (31b) 
Df(t) := -^i?f(t,xO(,),z?,p,(t),Q^(t)), (31c) 



where the maximization in ( [31b| ) is over the sub-vector 
Xc^(^) and output vector z^. 
For sum-product GAMP, let 

z^(t) = E(z,), Q?(t) = iivar(zO, (32) 

where z^ is the component of the pair (xQ;(i),Zi) with 
the joint distribution 

Pi(t,Xc,(i),Zi) (X 

exp(^ii7f(t,x,(,),z„p,(t),Q[(t))) . (33) 

Then, for either max-sum or sum-product compute 

Mt) = Q-P(i)[z?(t)-pi(i)], (34a) 
QUt) = Q-^(t)-Q-^(i)D-^(i)Q-P(t).(34b) 

7) Variable node update, linear step: For all variables nodes 
j compute 

Q-^(t) = Yl A*,.QKt)A,„ (35a) 
r,it) = x(t) + QJ(t) Y A*,s,(t). (35b) 

ie/3(j) 

Increment t and return to step 2 for some number of 
iterations. 

Although the Hybrid-GAMP algorithm above appears much 
more complicated than standard loopy BP (Algorithm [T]), 
Hybrid-GAMP can be computationally dramatically cheaper. 
Recall that the main computational difficulty of loopy BP is 
Step 2, the factor update. The updates ([14]) and ([T5j involve an 
optimization or expectation over \d{i)\ variables, where d{i) 
is the set of all variables connected to the factor node i. In the 
Hybrid-GAMP algorithm, these computations are replaced by 
([28]) and ([29]), where the optimization and expectation need 
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only be computed over the strong edge variables a{i). If the 
number of edges is large, the computational savings can be 
dramatic. The other steps of the Hybrid-GAMP algorithms are 
all linear, simple least- square operations, or componentwise 
nonlinear functions on the individual variables. 

A. Variants 

For illustration, we have only presented one form of the 
Hybrid-GAMP procedure. Several variants are possible: 

• Discrete distributions: The above description assumed 
continuous valued random variables xj. The procedures 
can be easily modified for discrete-valued variables by 
appropriately replacing integrals with summations. 

• Message scheduling: The above description also only 
considered a completely parallel implementation where 
each iteration performs exactly one update on all edges. 
Other so-called message schedules are also possible and 
may offer more efficient implementations or better con- 
vergence depending on the application p4| , pS] . 

VI. Application to Structured Sparsity 
A. Hybrid-GAMP Algorithm 

To illustrate the Hybrid-GAMP method, we consider the 
group sparse estimation problem (2E\, (TT\. Although this 
problem does not utilize the full generality of the Hybrid- 
GAMP framework, it provides a simple example of the 
Hybrid-GAMP method and has a number of existing algo- 
rithms that can be compared against. 

A general version of the group sparsity problem that falls 
within the Hybrid-GAMP framework can be described as 
follows: Let x be an n-dimensional vector with scalar com- 
ponents Xj, j = l,...,n. Vector- valued components can 
also be considered, but we restrict our attention to scalar 
components for simplicity. The component indices j of the 
vector X are divided into K (possibly overlapping) groups, 
Gi, . . . , Gk ^ {I7 • • • 7 We let 7(j) be the set of group 
indices k such that j G Gk- That is, the 7(j) is the set of 
groups for which the component xj belongs to. 

Suppose that each group Gk can be "active" or "inactive", 
and each component Xj can be non-zero only when at least one 
group Gk is active for some k G 7(j). Qualitatively, a vector x 
is sparse with respect to this group structure if it is consistent 
with only a small number of groups being active. That is, most 
of the components of x are zero with the non-zero components 
having support contained in a union of a small number of 
groups. The group sparse estimation problem is to estimate 
the vector x from some measurements y. The traditional (non- 
group) sparse estimation problem corresponds to the special 
case when there n groups of singletons Gj = {j}. 

Particularly with overlapping groups, there are a number 
of ways to model the group sparse structure in a Bayesian 
manner. For illustration, we consider the following simple 
model: For each group Gk, let ^k ^ {0, 1} be a Boolean 
variable with = 1 when the group G/c is active and = 
when it is inactive. We call ^k activity indicators and model 
them as i.i.d. with 

P{^k = 1) = 1 - P(a = 0) = p (36) 



P&) P(Xj|^y(j)) Xj 




Fig. 4. Graphical model for the group sparsity problem with overlapping 
groups. The group dependencies between components of the vector x are 
modeled via a set of binary latent variables ^. 



for some sparsity level p G (0,1). We assume that given 
the vector ^, the components of x are independent with the 
conditional distributions 

r if a=0forall/cG7(j) .37. 
^ \ V otherwise, 

where F is a random variable having the distribution of 
the component Xj in the event that it belongs to an active 
group. Finally, suppose that measurement vector y is generated 
by first passing x through a linear transform z = Ax, 
and then a separable componentwise measurement channel 
with probability distribution functions p{yi\zi). Many other 
dependencies on the activities of x and measurement models 
y are possible - we use this simple model for illustration. 

Under this model, the prior x and the measurements y 
are naturally described by a graphical model with linear 
mixing. Due to the independence assumptions, the posterior 
distribution of x given y factors as 

^ m n K 

p(x|y) = ^ X{p{yi\zi) n n^M-^u)) n ^(^'»)' (38) 

where P(xj|^^(j)) is the conditional distribution for the ran- 
dom variable in ([37]). The factor graph corresponding to this 
distribution is shown in Fig. [4] 

Under this graphical model. Appendix [C] shows that the 
sum-product version of Hybrid-GAMP algorithm in Algorithm 
|2] reduces to the simple procedure in Algorithm [3] A similar 
max- sum algorithm could also be derived. In lines [8] and [9j we 
have used the notation E{X\R; Q^, p) and \ar{X\R; Q^, p) to 
denote the expectation and variance of the scalar variable X 
with distribution 

^ f with probability I — p 
\ V with probability p; 

and R is an AWGN corrupted version of X 

R = X^W, iy-A/'(0,Q^). (40) 

The algorithm can be interpreted as the GAMP procedure 
in (TSj run in a parallel with updates the sparsity levels. 
Specifically, each iteration t of the main repeat-until loop has 
two stages. The first half of the iteration, labeled as the "basic 
GAMP update", is identical to the standard updates from the 
basic GAMP algorithm |18|, treating the components Xj as 
independent with sparsity level Pj{t). The second half of the 
iteration, labeled as the "sparsity update", updates the sparsity 
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Algorithm 3 Sum-product Hybrid-GAMP for group sparsity 



1: {Initialization} 



^ 



LLR,^fe(t-l)^log(p/(l 
repeat 



n 



1/(1 



-p)) 

exp LLRj^/e(t- 



■1)) 



f expLLRi^fe(t-l)) 



7: {Basic GAMP update} 

Xj{t) ^E{X\R = rj{t-l),^j, 

Q!it)^E,\Aij\'Q]{t) 

Pi{t) ^Zi{t)-Qf{t)si{t-1) 
Qf{t)^yar{z,\pi{t),QPi{t)) 

Qm^Q7'im-Qnt)/Q'iit)) 
Qr{t)^Ei\Aj\'Qm 

19: {Sparsity level update} 

21: Compute LLRj^k{t) from ([41]) 

22: LLRj^kit) ^ log(p/(l - p)) 4 
23: ^ 1 - Ukejij) 1/(1 + exp LLR,-^fe(t)) 

24: t^t^l 

25: until Terminate 



levels pj {t) based on the estimates from basic GAMP half of 
the iteration. 

The sparsity update half of the iteration in Algorithm |3] 
also has a simple interpretation. The quantities Pj{t) and 
Pj^k{t) can be interpreted, respectively, as estimates for the 
probabilities 

Pj = Pr(^/e = 1 for some k G \ y) 
Pj^k = Pr = 1 for some i G ^k\y). 

That is, Pj{t) is an estimate for the probability that the 
component xj belongs to at least one active group and pj^k{t) 
is the estimate for the probability that it belongs to an active 
group other than Gk- Similarly, the quantities LLRj^k{t) and 
LLRj^k{t) are estimates for the log likelihood ratios 



LLR, 



log 



P[^k = 0\y) 

Most of the updates in the sparsity update half of the iteration 
are the natural conversions from the LLR values to estimates 
for Pj and pj^k- In line 21 the LLR message is computed 
by 



LLR,^fc(t) = log 



PR{rj(t);Q-^{t),p=l) 

PR ijj (t) ]Q)(t),p = pj^k{t) ) 



(41) 



where pR{r; Q^, p) is the probability distribution for the scalar 
random variable R in ( |40| ) where X has the distribution ([39]). 
The message ( |4T] ) is the ratio of the likelihood of the output 
r given that Xj definitely belongs to an active group to the 



likelihood given that Xj belongs to an active group other than 
the group Gk- 

In this way, the Hybrid-GAMP procedure provides an 
intuitive and simple method for extending the basic GAMP 
algorithm of |18 | for structured sparsity. 

The Hybrid-GAMP algorithm for group sparsity is also 
extremely general. The algorithm can apply to arbitrary pri- 
ors and output channels. In particular, the algorithm can 
incorporate logistic outputs that are often used for group 
sparse classification problems |[28|-p0|. Also, the method can 
handle arbitrary, even non-overlapping, groups. In contrast, the 
extensions of other iterative algorithms to the case of non- 
overlapping groups sometimes requires approximations. See, 
for example, |31|. In fact, the methodology is quite general 
and likely be applied to general structured sparsity, including 
possibly the graphical model based sparse structures in image 
processing considered in |[32|. 



B. Computational Complexity 

In addition to its generality, the Hybrid-GAMP procedure 
is among the most computationally efficient. To illustrate 
this point, consider the special case when there are K non- 
overlapping groups of d elements each. In this case, the total 
vector dimension for x is n = Kd. We consider the non- 
overlapping case since there are many algorithms that apply 
to this case that we can compare against. For non-overlapping 
uniform groups. Table |l] compares the computational cost of 
the Hybrid-GAMP algorithm to other methods. 

The dominant computation in each iteration of the Hybrid- 
GAMP algorithm. Algorithm |3] is simply the matrix multi- 
plications by A and A* and their componentwise squares of 
A and A*. These operations all have total cost 0{mn) = 
0{mdK). Note that these multiplications could be cheaper if 
the matrix has any particular structure (e.g. sparse, Fourier, 
etc). The other per iteration computations are the m scalar 
estimates at the output (lines [13] and [14]); the n scalar estimates 
at the input (lines [8] and [9]) and the updates of the LLRs. 
All these computations are less than 0{mn) for the matrix 
multiply. 

For the case of non-overlapping groups, the Hybrid-GAMP 
algorithm could also be implemented using vector-valued 
components. Specifically, the vector x can be regarded as a 
block vector with K vector components, each of dimension 
d. The general Hybrid-GAMP algorithm. Algorithm [2] can be 
applied on the vector- valued components. To contrast this with 
Algorithm [3] we will call Algorithm [3] Hybrid-GAMP with 
scalar components, and call the vector- valued case Hybrid- 
GAMP with vector components. 

The cost is slightly higher for Hybrid-GAMP with vector 
components. In this case, there are no non-trivial strong edges 
since the block components are independent. However, in 
the update ( |26c| ), each Aij is 1 x d and Qj{t) is d x d. 
Thus, the computation ( [26c| ) requires mK computations of 
d'^ cost each for a total cost of 0{mKd^) = 0{mnd), 
which is the dominant cost. Of course, there may be a benefit 
in performance for Hybrid-GAMP with vector components, 
since it maintains the complete correlation matrix of all the 



9 



Method 




Group-OMP | 34 | 


0{pmn^) 


Group-Lasso 126], |27|, 135] 


0(mn) per iteration 


Relaxed BP with vector components (33) 


0{mn^) per iteration 


Hybrid-GAMP with vector components 


0{mnd) per iteration 


Hybrid-GAMP with scalar components 


0(mn) per iteration 



TABLE I 

Complexity comparison for different algorithms for group 
sparsity estimation of a sparse vector with k groups, each 
group of dimension d. the number of measurements is 171 and 
the sparsity ratio is p. 



components in each group. We do not investigate this possible 
performance benefit in this paper. 

Also shown in Table U is the cost of a relaxed BP method 
of p3| , which also uses approximate message passing simi- 
lar to Hybrid-GAMP with vector components. That method, 
however, performs the same computations as Hybrid-GAMP 
on each of the mK graph edges as opposed to the m + K 
graph vertices. It can be verified that the resulting cost has an 
0{mK'^(P) = 0{mn^) term. 

The cost is slightly higher for GAMP with vector com- 
ponents. In this case, there are no non-trivial strong edges 
since the block components are independent. However, in the 
update ( |26c| ), each A^j is 1 x d and {t) is dx d. Thus, the 
computation ( |26c| ) requires mn computations of d'^ cost each 
for a total cost of 0{mnd^), which is the dominant cost. 

These message passing algorithms can be compared against 
widely-used group LASSO methods p6| , | [27| which estimate 
X by solving some variant of a regularized least- squares 
problem of the form 

1 

S := argmin -||y - A^f + 7^ ||x,||2, (42) 

for some regularization parameter 7 > 0. The problem ( [42| ) is 
convex and can be solved via a number of methods including 
p5j|-|37||, the fastest of which is the SpaRSA algorithm of 
| |35| . Interestingly, this algorithm is similar to the GAMP 
method in that the algorithm is an iterative procedure, where 
in each iteration there is a linear update followed by a 
componentwise scalar minimization. Like the GAMP method, 
the bulk of the cost is the 0{mn) operations per iteration for 
the linear transform. An alternative approach for group sparse 
estimation is group orthogonal matching pursuit (Group-OMP) 
of pm , p4| , a greedy algorithm that detects one group at 
a time. Each round of detection requires K correlations of 
cost md'^. If there are on average pK nonzero groups, the 
total complexity will be 0{pK'^md^) = 0{pmii?). From the 
complexity estimates summarized in Table |l| it can be seen that 
GAMP, despite its generality, is computationally as simple (per 
iteration) as some of the most efficient algorithms specifically 
designed for the group sparsity problem. 

Of course, a complete comparison requires that we consider 
the number of iterations, not just the computation per iteration. 
This comparison requires further study beyond the scope of 
this paper. However, it is possible that the Hybrid-GAMP 
procedure will be favorable in this regard. Our simulations 




Num measurements (m) 

Fig. 5. Comparison of performances of various estimation algorithms for 
group sparsity with n = 100 groups of dimension d = A with a sparsity 
fraction of p = 0.1. 

below show good convergence after only 10-20 iterations. 
Moreover, in the case of independent (i.e. non-group) sparsity, 
the number of iterations for AMP algorithms is typically 
small and often much less than other iterative methods. For 
examples, the paper |16| shows excellent convergence in 10- 
20 iterations which is dramatically faster than iterative soft 
thresholding method of p8| . 

C. Numerical Simulation 

Fig. |5] shows a simple simulation comparison of the mean 
squared error (MSE) of the Hybrid-GAMP method (Algorithm 
|3]) along with group OMP, group LASSO and a simple linear 
minimum MSE estimator. The simulation used a vector x 
with n = 100 groups of size d = 4 and sparsity fraction of 
p = 0.1. The matrix was i.i.d. Gaussian and the observations 
were with AWGN noise at an SNR of 20 dB. The number 
of measurements m was varied from 50 to 200, and the plot 
shows the MSE for each of the methods. The Hybrid-GAMP 
method was run with 20 iterations. In group LASSO, at each 
value of m, the algorithm was simulated with several values of 
the regularization parameter 7 in ( |42| ) and the plot shows the 
minimum MSE. In Group-OMP, the algorithm was run with 
the true value of the number of nonzero coefficients. It can be 
seen that the GAMP method is consistently as good or better 
than both other methods. All code for the simulations can be 
found in the SourceForge open website |39|. 

We conclude that, even in the special case of group sparsity 
with AWGN measurements, the GAMP method is at least 
comparable in performance and computational complexity to 
the most competitive algorithms. On top of this, GAMP offers 
a much more general framework that can include more rich 
modeling in both the output and input. 

VII. Conclusions 

A general model for optimization and statistical inference 
based on graphical models with linear mixing was presented. 
The linear mixing components of the graphical model account 
for interactions through aggregates of large numbers of small. 
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linearizable perturbations. Gaussian and second-order approx- 
imations are shown to greatly simplify the implementation 
of loopy BP for these interactions, and the Hybrid-GAMP 
framework presented here enables these approximations to 
be incorporated in a systematic manner in general graphical 
models. Simulations were presented for group sparsity where 
the Hybrid-GAMP method has equal or superior performance 
to existing methods. However, the generality of the method 
will enable GAMP to be applied to much more complex 
models where few algorithms are available. In addition to 
experimenting with such models, future work will focus on 
establishing rigorous theoretical analyses along the lines of 

Appendix A 

Heuristic Derivation of Max-Sum Hybrid-GAMP 

A. Preliminary Lemma 

Before deriving the Hybrid-GAMP approximation for the 
max- sum algorithm, we need the following result. Let 
i^(w,v) be a real- valued function of vectors w and v of 
the form 

1„ 



which implies that 



i^(w, v) Ho{w) - 2 11^ 



for some positive definite matrix Q^. For each v, let 
w(v) := arg maxi^(w, v), 

w 

G(v) := i^(w(v), v) = maxi7(w, v). 



(43) 

(44a) 
(44b) 



Lemma 1: Assume the maximization in ( |44| ) exists and is 
unique and twice differentiable. Then, 

^ ^ Q-"(w(v)-v), 



where 



D 



= -D-iQ", 

= -Q-"-Q-"D-iQ-", 

a2ij(w,v) 



(45a) 
(45b) 
(45c) 



w=w(v) 

Proof: Since w = w(v) is a maximizer of H{w, v) 



dw 

Therefore, ( |45a| l follows from 
dG{v) 5iJ(w(v),v) 



= 0. 



(46) 



9v 



dw 



c^w(v) I c^if(w(v),v) 

aij(w(v),v) 



Q-"(w(v)-v), 



where the last step is a result of the form of i^(-) in ( [43] ). The 
form of H{-) in ( [43] ) also shows that for all w and v 

dwdv 

Taking the derivative of ( |46| ), 

d'^H{w, v) S^i^ (w, v) 5w(v) 



dwdv 



dv 



0, 



9w(v) 
dv 



-D-lQ-^ 



which proves ( |45b| ). Finally, taking the second derivative of 
( |45a| ) along with ( [45b| ) shows ( |45c| ). ■ 



B. Hybrid-GAMP Approximation for Max-Sum 

First consider the factor node update ([14]) and partition the 
objective function Hi^j{-) in ([13]) as 



'^«(^)' + (i, (47) 



where 



•= /i(xa(i),Zi) + ^ Ai^rit.^r), (48a) 

HJZf{t^^m)-= ^^^r{t^''r)• (48b) 

That is, we have separated the terms in the objective function 
Hi^j{-) between the strong and weak edges. We can also 
partition the maximization ([14]) as 

= max [Af_;7^(t, x„ z,) + A^f (t, x„ z,)] , (49) 

where 

AZj^t,Xj,Zi) := max i/Hr(t,x«(,),Zi), (50a) 
max fff^f (t,x^(,)), (50b) 



Zi=AjX 



with the maximization in ( |5Qa| ) being over all x^ with r G 
a{i) \ j; and the maximization in ( |5Qb| ) over all x^ with r G 
\ j subject to Zi = A^x. The partitioning ( |49l ) is valid 
since the strong and weak edges are distinct. This insures that 
for all r G S{i), either r G or r G /3(i), but not both. 

The Hybrid-GAMP approximation applies to the weak term 
( |50b| ). For any j and all weak edges (^, j), define: 



X j {t) : = arg max A j (t , x j ) , 
%^j{t) := argmax Ai^^(t,x^), 



52 



Sx? 



_2 Aj(t, Xj 
2 



(51a) 
(51b) 

(51c) 

, (5 Id) 



which are the maximum and Hessian of the incoming weak 
messages. Since the assumption of the Hybrid-GAMP algo- 
rithm is that A^^ is small for all weak edges (i, r), the values 
of x^ in the maximization ( |5Qb| ) will be close to x^^^(t). 
So, for all weak edges, (i,r), we can approximate each term 
Ai^r{t^^r) in ( |48b| ) with the second-order approximation 



Ai^r (^7 -^r ) 



1, 



Ai^r{tjXi^r{t)) — -||Xr — X^^^ || 



(52) 
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where we have additionally made the approximation 
Qf^r(^) ^ Qrit) for all i. Substituting ([52]) into ( |48b| ), the 
maximization dSObl) reduces to 



Now let 



Ar_r/(^,x,,z,)^ const 



max 



, (53) 



where the constant term does not depend on Xj or z^. 

To proceed, we need to consider two cases separately: when 
j G and when j ^{i). First consider the case when 
j G I3{i). That is, (z, j) is a weak edge. In this case, a standard 
least squares calculation shows that ([53]) reduces to 



Ar-/(t,x,,z,)^ const 



where 



rG/3(i)/j 



(54) 

(55a) 
(55b) 



x^, so we can write 



Also, when j G f3{i), the assumption that a{i) and are 
disjoint implies that j In this case, A-^°"^(t, Xj, z^) 
in ( |5Qa| ) with the objective function ( |48a| ) will not depend on 



, (56) 



A^!;7^(t,x„z,)=Af-^(t,z,) 



:= max 



rea{i) 



where the maximization is over all x^ for r G a{i). Combining 
( [49] ), ( [54] ) and ( [56] ), we can write that, for all weak edges (i, j), 



where 



Gi{t,pi) := max i^f (t, Xc,(i), z^, p^, (t)) 



and i^f (•) is defined in ( |31a| ). Now, define 

rG/3(i) 

SO that the expressions in ( [55] ) can be re-written as 

QUjit) = Qr(t)-A,.Q^(t)A*,. 



(57) 
(58) 

(59a) 
(59b) 



(60a) 
(60b) 



Using ( [60] ), neglecting terms of order 0(||Aij|p) and taking 



the approximation that Sti^j{t) ^ Xj(t), ( [57] ) can be further 
approximated as 



/\i^j{t,^j) ^ Gi{t,pi{t) Aij{-Kj -x^-(t))) 



(61) 



Si(t) = ^G,(t,p,(t)) 

Qrw = -^Gz(t,p.(t)). 



(62a) 
(62b) 



Based on the definition of Gi(-) in ( [58] ) with i^f (•) defined 
in ( [31a| ), one can apply Lemma [T] to show that ( [62] ) agrees 
with ( [34] ). Also applying ( [62] ), we can take a second-order 
approximation of ( [6T] ) as 



A^^j(t,Xj) ^ const 



1, 



+ s,(t)*A,,(x, -x,(t)) - -||A,,(x, -x,(t))||^.(,) 
= const + [A*^-s,(t) + A^Q,^(t)A,,%(t)] * x, 

+ ix*A^Q,^(t)A,,x, (63) 



for all weak edges (i, j). 

Next consider the case when j so that (z, j) is a 

depend on Xj, so we can write 



strong edge. In this case, AJ!^^^(t, Xj, z^) in ( [53] ) does not 



.weak 



(t,x„z,) ^ const + Ar"^t,Zz) 



where 



Ar-'^(t,Zi) := max 

X : Zi=A^x 



(64) 



(65) 



with the maximization being over x such that z^ = A^x. Using 
a similar least-squares calculations as above, AJ^®^^(t, z^) is 
given by 



(66) 



and pi(t) and (t) are defined in Combining (|49]), (|64] 
and ( |66l ), we can write that, for all strong edges 



Ai^j{t,Xj) Si const 



A^!;7^(t,x„z,) 



(67) 



From ( [48a| ) and ( [50a| ), we see that ( [67] ) agrees with the factor 
node update ( [28] ) for the the strong edges. 

We now turn to the variable node update ( [Tt] ) which we 
partition as 



A,^,(t,x,) = A-f (i,x,) + A:::7^(t,x,), 



(68) 



where 



Af""*^(t+l,x,) 



J2 A^^j(i,x,) (69a) 



Substituting the approximation ([63]) into the summation (|69b|) 



k weak 



(t+l,x,-) « -i||fi^,(t) -x,||2jj.^^(,), (70) 
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where 

= x(t) + Qr^,.(t)^Al^.s,(t). (71b) 

We now again consider two cases: when (z, j) is a weak edge 
and when it is a strong edge. First consider the weak edge 
case. In this case, j a{i), so A-^*^"^(t+l, Xj) in ( |69a| ) does 
not depend on i. Combining ( [68] ) and ( |70| ), we see that 

A,^,(t+l,x,) ^i^;(t,x,,f,^,(t),Q[^^.(t)), (72) 

where i^J(-) is defined in ( [23] ). Also, comparing ( [35] ) with 
( [tT] ), we have that 



Q-^(t) « Q-'-(t) (73a) 
Vi^j{t) ^ f,(i)-QJ(t)A^Si(t). (73b) 

Substituting ( fTS) ) into (|72]i we get 

Ai^,-(t+l,x,-) 

« ffj (t, X,- , (t) - QJ {t)A*^% (t) , QJ (t) ) . (74) 

A similar set of calculations shows that Aj(t + l,Xj) in ( [TS] ) 
can be approximated as 



A,(t+l,x,)«J?f(t,x,-,f,(i),Q5(t)). 



(75) 



Thus, the definitions of Xj(t+1) and QJ(t+l) in ( [5T] > agree 
with ([24](. Also, if we let 

Tj{t,Yj) := argmaxiJJ(t,Xj,fj,QJ(i)), 
it follows from (|5l]l, (|74]i and (|75]l that 

x,(t+i) «rj(t,fj(t)) 

ar,.(t,r,(t)), 



x,.(t) 



dr-j 



'-QUt)A*Mt). (76) 



It can be shown from Lemma \T\ that 



Che] 



Q-'-(t) 



« Q-(t)Q-'-(i), 
and hence, from ( [76] ), 

x,^,(t+l) ^ x,(t+l) - Q"(t+l)A*^.s,(t). (77) 
Substituting ( [TT] ) into ( [59] ) we obtain 

p,(t)^ ^ A,,x,(t)- ^ A,,Q-(t)A^s,(t-l) 

je/3(i) iG/3(i) 

« z,(t)-Q^(t)s,(t-l), 
which agrees with the definition in ([26]). 



Appendix B 
Heuristic Derivation of Sum-Product 
Hybrid-GAMP 

A. Preliminary Lemma 

The derivation of the approximate sum-product algorithm 
is similar to the derivation in Appendix [A| for the max- 
sum algorithm. In particular, the analogue to Lemma [T] is as 
follows: 

Lemma 2: Suppose that W and V are random vectors with 
a conditional probability distribution function of the form 

Pw|v(w I v) = exp [uH{w, v)] , 



where i7(w,v) is given in ( [43] ), u > is some constant 
and Z{v) is a normalization constant (called the partition 
function). Then, 

^-x(v) = DQ-^ 



d_ 



5v 

logZ(v) = Q--(x(v)-v) 



(78a) 
(78b) 



log Z(v) 



Q"^ + Q-^DQ-^ (78c) 



where 



S(v) =E[W|V = v], D = ixvar(W|V = v). 

Proof: The relations are standard properties of exponen- 
tial families j2|. ■ 

B. Approximation for Hybrid-GAMP 

The derivation of the Hybrid-GAMP algorithm for the sum- 
product algorithm is similar to the derivation of the max- sum 
algorithm in Appendix [A] so we will just sketch the proof 
here. Similar to the max-sum derivation, we first partition 
the function Hi^j{-) in ( [T3] ) as in ( [47] ). Then, the marginal 
distribution pi^j (t, Xj) of the distribution pi^j (t, x^^) in ( (16] ) 
can be re-written as 



(X 



it^^j) = J Pi^j{t^^d{i))d^d{i)\j 
j ^f_l'';^{t, X, , zOC4?(^, ^j.z,)dz,, (79) 



where 



(X J exp [iii^^°''^(t,Xc,(i),Zi)] dxc,(i)\^- (80a) 

X«(i)\j 



(X J exp [uHJZfit, x^(,))] dx^(,)\, (80b) 

Zi=AiX 

where the integration in ( [80a| ) is over the variables x^ with 
r G a{i) ^ j, and and the integration in ( |80b| ) is over the 
variables x^ with r G /3(z) 7^ j, and = A^x. 

To approximate pi^j(t,Xj) in ( [79] ), we separately consider 
the cases when (z,j) is weak edge and when it is a strong 
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edge. We begin with the weak edge case. That is, j G /3{i). 
Let 



11 var[xj ; Aj(t, •)] 




where we have used the notation E[^(x) ; A(-)] from ([12]). 

Now, using the expression for x^(^)) in ( |48b| ), 

it can be verified that Xj, z^) is equivalent to the 

probabiHty distribution of a random variable 

A^T-X^- , 



■ E 



(82) 



with the variables x^ being independent with probability 
distribution 

p(x^) (X exp('uA^^^(x^)). 
Moreover, %^j{t) and Q^^At)/u in ([ST]) are precisely the 



mean and variance of the random variables x. under this 



distribution. Therefore, if the summation in ( |82| ) is over a large 
number of terms, we can then use the Central Limit Theorem 
to approximate the variable in in ([82]) as Gaussian, with 
distribution Xj, z^) given by 

i;T^f{t,x„Zi)^Af{A,,^j+p,^,{t),QUjit)/u), (83) 

where Pi^j{t) and Q^_^^(t) are given in ( [55] ). Applying 
this Gaussian approximation into the probability distribution 
Pi^j{t^^Q(^i^^7^i) in ( [T6] ), and then using the definitions in 
( |48a| ) and ( |8Qa| ), we obtain the approximation for the message 
in ([BJ 

A,^,(t,x,) ^ G,(t, A,,x, +p,^,(t),Q[_^^.(t)) 

where 

:= ;^log j exp [ixi^f (t,Xc,(i),Zi,pi,Q[)] (iXc,(i)C^Zi(84) 



and i^f (•) is given in ( |31a| ). 

We now define Pi(t) and Q^(t) as in ( [59] ), and similar to 
([62|) let 



Si(t) 



9p 

9p' 



,G,(t,p,(t),Qnt)). 



(85a) 
(85b) 



Using Lemma [2] one can show that the definitions in ( [85] ) agree 
with the updates ( [34] ) where z^(t) and Qf(t) are the mean 
and covariance of the random variable z^ with the distribution 
( [33] ). Using a similar approximations as in the derivation of 
the max-sum algorithm, one can then obtain the quadratic 
approximation in ( [63] ) for Ai^j(t,Xj) for all weak edges 
{id). 

Next consider the case when j so that (i, j) is a 
strong edge. In this case, V^^^^(t, Xj^Zi) does not depend on 
Xj and a similar calculation as above shows that 

C4f X,, Zi) « C'-^t, z,) := M{pi{t),Clf{t)/u), (86) 



where Pi(t) and Q^{t) are defined in ( [59] ). Substituting the 
Gaussian approximation ( [86] ) into ( [T6] ), and then using the 
definitions in ( |48a| ) and ( [80a| ), one can show that the marginal 
distribution pi^j(t,Xj) in ( [T6] ) is equal to the marginal dis- 
tribution of Xc^(^), z^) in ( [30] ). Therefore, the message 



Ai^j{t^ Xj) in ( [T5] ) can be written as ( [29] ) for all strong edges 

We now turn to the variable update steps of the sum- 
product algorithm. Since this step is identical to the max-sum 
algorithm, one can follow the derivation in Appendix \A\ to 
show that A^^j(t+l,Xj) and A^(t+l,Xj) are given by (|74|) 
and ( [75] ), respectively and rj{t) and Q^{t) are given in ( [35] ). 
Also, the definitions of Xj(t) and Qj{t) in ( |81a| ) and ( |81c| ) 
are consistent with ( [25] ). 

Finally, define 

r,(t,f,) := E [x, ; i7;(t,-,f„QJ(t-l))] , (87) 

where, again we are using the notation ( [T2] ) and Hj{-) is 
defined in ([23]). It follows from ([74]), ([75]) and ^ that 

x,(t+i)^r,(t,f,(t)) 

x,^,(t + l) ^ r,(t,r,(t) - Q^^(t)A^s,(t)) 



gr,(t,r,(t)) 



Q^(t)A*,.Si(t). (88) 



From the definition ( [87] ), Lemma [2] shows that 



(89) 



and hence, from 



x,^,(t+l) ^ x,(t+l) - Q"(t+l)A*^.s,(t). (90) 

The proof now follows identically to the derivation of the 
Hybrid- GAMP max-sum algorithm. 

Appendix C 

Derivation of Hybrid-GAMP for Group Sparsity 

This Appendix provides a brief derivation of the steps in 
Algorithm [3] based on the general Hybrid-GAMP algorithm. 
Algorithm [2| In the description of the general Hybrid-GAMP 
algorithm, we used labels i and j for the factor and variable 
nodes. However, the group sparse estimation problem has a 
large number of indices. To avoid confusion, we adopt the 
following more explicit (albeit somewhat more cumbersome) 
labeling. The variables nodes will be labeled explicitly by xj 
or ^fc. For the factor nodes, we use the labels: 

• ai for the factors p{yi\zi)', 

• bj for the factors P(xj|^^(j)); and 

• Ck for the factors P{(,k)- 

With this convention, for example, A^^^^^ (t, ^/c) represents 
the message from the variable node to the factor node bj 
when j e Gk- 

Now, in the graphical model in Fig. |4j the strong edges are 
all the edges to the right of the variables xj. That is, the strong 
edges are: 

• Between the variables xj and factors bj for all j; 

• Between the variables and factors bj for all j G Gk', 
and 
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• Between the variables and factors Ck for all k. 
The remaining edges, those between the variables xj and the 
factor nodes a^, are all weak. 

With these definitions, we can easily derive the steps in 
Algorithm [3] from the general Algorithm [2] First, note that all 
the steps from lines [T0| to [18] are simply the weak edge updates 



in Algorithm |2] specialized to the case of scalar variables. 

To understand the role of the remaining lines, first consider 
the message along the strong edge from the factor node Ck and 
the variable ^k- The factor node Ck corresponds to the prior 
P{^k) in ([36]). Since the factor is attached to only one variable 
node, the outgoing message in ([29]) for this edge reduces to 



P if & = 1, 
p if = 0, 



(91) 



where the last step follows from ( [36] ). 

Next consider the message along the strong edge from the 
variable to the factor node bj for some j G Gk- Similar to 
the case of binary LDPC codes |23|, since = or 1, it is 
convenient to work with log-likelihood ratios (LLRs). Given 
any strong edge between bj and ^/e, define the LLR, 

LLR,^,(t) := A5^.^^,(t,a = 1)-A5^.^^,(t,a = 0). (92) 

The reverse LLR, LLRj^k{t) is defined similarly. 

Since the variable node ^/c is not connected to any weak 
edges, the variable node output message in ( [2T] ) reduces to 

ieGky^j 

Therefore the LLR in ([92]) is given by 



LLR 



^,^fe(^+l) = A,,_ 

reGk^j 

-- log 



-kit) 



E 

reGky^j 



(93) 



where the last step follows from ( [9T] ). 

Next consider the message from bj to Xj. Recall that the 
factor node bj corresponds to the distribution P{xj\^^^j^), 
defined by the variable Xj in ( [37] ). Also, this factor node has 
no weak edges. Hence, it can be verified, that the message 
( [29] ), applied to the edge from the factor node bj to Xj, is 
given by 

Ab^^^^{t,Xj) = logPb^^^^{t,Xj), (94) 

where Pi^.^xjit, Xj) is the probability distribution function 

Pb^^,^it,xj)=E[P{xj\^^^j))], (95) 

and the expectation is over independent variables with 

1 



P{ik = l) = l-P{ik = 0) 



(96) 



l + exp(-LLR^-^/e(t))' 

Using the fact that the P(xj|^^(j)) is the conditional dis- 
tribution for the variable in ( [37] ), the probability distribution 
Pi).^x.{t^Xj) in ( [95] ) can be written 

P,^^,^{t,xj) = Px{xj;p = pj{t)), (97) 



where Px{x; p) is the distribution for the variable X in ( [39] ) 
and pj {t) is the probability 



p,(t) = Pr(a=0, VfcG7(i)) 

- nr ' 



•exp(LLR^-^fc(t))* 



(98) 



Now, the variable node Xj has only one strong edge - the 
connection to the factor node bj . Therefore, the log probability 
in (l22l) reduces to 



^{t,Xj) 



^I^.W-x.l^ (99) 



A^^{t^l,Xj) = Ab 



Now, as described in equations ( [94] ) and ( [97] ), Ab.^x.{t,Xj) 
is the log of the probability distribution for the variable X in 
( [39] ) with p{t) = pj{t). Hence A^^ (t+l,x^) in ([99]) must be 
the log posterior distribution for the X with the measurement 
R = r{t) in ( [4Q| . Therefore, the expectations and variances in 
( [25] ) agree with the expressions in lines [S] and [9] 

Finally, consider the message from the factor node bj to a 
variable node • The derivation for this message is similar to 
the message from bj to Xj . Specifically, it can be verified that 
the factor node message ( [29] ), applied to the strong edge from 
bj to ^fe, is given by 

A6,^a(^.^fe) = logP6,^a(t,a), (100) 
where Pfo^^^^ (t, ^/c) is the probability mass function 

= J ex.pAb^^x^{t-l,Xj)E{P{xj\^^^j^)\^k)dxjllOl) 

where the expectation is over independent variables with 
probabilities in ( [96] ). To evaluate the expectation on the 
right-hand side of ( |101| ), consider the conditional expectation 
E(P(xj|^^(j))|^/e). Since the distribution P{xj\^^^j^) corre- 
sponds to the random variable Xj in ( [37] ), 



E(p(x,-ic^(,))|a) 



1) 



if a 
if a 



(102) 



where Px{x;p) is the probabiUty distribution for the random 
variable X in ((39ll and 



pj^kit) = 1 - Pr (Ci = 0, Vi G 7(j) ^ fc) 

= 1- n . . r, 77^- (103) 



1 + exp LLRi^/e(^) 



Also, the edge from variable node Xj to the factor node bj is 
the only strong edge connected to Xj. Therefore, the variable 
node message ( [2T] ) applied to that edge reduces to 

A,^^,^{t-l,xj) = --^-L_.\xj-rj{t-l)\\ (104) 



Substituting ( |102| l and ( |104| l into ( |101| > we obtain that 



OC 



Pfi(r,(t-l);Q^(i-l),l) 



if a 



Pfl(fj(t-l);QJ(t-l),?j^fc(t) if a = 



(105) 
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where pR{r; Q^, p) is the probabihty distribution of the scalar 
random variable R in ( |4Q| with X being distributed in ([39]). 
The LLR corresponding to ( |105| ) is thus given by 

LLR,^fc(t) = logP5^.^^,(t,a = l) 
which agrees with (|4TJ. 
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